JEFS 99 - Capstone(Python)
Loading...

Spark Logo Tiny Revenue Insights of a Sports Retailer

This is a Capstone project, which is design to validate your learnings from the Just Enough Scala for Spark Course.

This Capstone presents you with an opportunity, where you can apply all your learnings including implmentation of classes, methods, exception handling, collections and functional programming to a large problem.

Spark Logo Tiny Problem Description

A sports retailer computes revenue performance of a company every end of the year.

The revenue related information is available in comma separated value format. All quarterly revenue information is available as a string, where each substring separated by a ; contains information about a specific product category in a quarter (in a financial year).

Each substring contains following information as comma separated values.

  • which quarter of the year (Q1/Q2/Q3/Q4 etc.)
  • name of the product category
  • total revenue generated in the financial quarter.

A sample data string is shown below.


 "Q1-2018   Exercise_Fitness    10.33, 
 Q1-2018   Outdoor_Play_Equipment  7.85, 
 Q1-2018   Winter_Sports   3.45"

Where each substring separated by ; is considered a record. In the subsequent section, all references to record means this substring (e.g. Q1-2018,Exercise_Fitness,10.33).

This capstone asks the participants to develop the module in 8 steps.

  1. Step 1: Initialize the string
  2. Step 2: Parse and Validate a record (substring) to figure out if the substring is properly formatted or not.
  3. Step 3: Define a method to filter good and bad records from the complete string.
  4. Step 4: Define a case class to represent a good record and convert all good records to an array of case classes
  5. Step 5: Create a Class that encapsulates all good records and provides methods to calculate insights
    • the total revenue generated by the sports company
    • which takes a product category and returns the total revenue generated

Spark Logo Tiny Step 1: Initialize the String

The string that contains the data is given below. Store the string in a variable named RevenueInfo.

 "Q1-2018,Exercise_Fitness,10.33; 
  Q1-2018,Outdoor_Play_Equipment,7.85; 
  Q1-2018,Winter_Sports,3.45;
  Q2-2018,Exercise_Fitness,7.63; 
  Q2-2018,Outdoor_Play_Equipment,5.05; 
  Q2-2018,Winter_Sports,-;
  Q3-2018,Exercise_Fitness,1.31; 
  Q3-2018,Outdoor_Play_Equipment,3.95; 
  Q3-2018,Winter_Sports,1.50;
  Q4-2018,Exercise_Fitness,5.71; 
  Q4-2018,Outdoor_Play_Equipment,6.52; 
  Q4-2018,Winter_Sports,4.15"
#ANSWERS
 
RevenueInfo = '''Q1-2018,Exercise_Fitness,10.33; 
                 Q1-2018,Outdoor_Play_Equipment,7.85; 
                 Q1-2018,Winter_Sports,3.45;
                 Q2-2018,Exercise_Fitness,7.63; 
                 Q2-2018,Outdoor_Play_Equipment,5.05; 
                 Q2-2018,Winter_Sports,-;
                 Q3-2018,Exercise_Fitness,1.31; 
                 Q3-2018,Outdoor_Play_Equipment,3.95; 
                 Q3-2018,Winter_Sports,1.50;
                 Q4-2018,Exercise_Fitness,5.71; 
                 Q4-2018,Outdoor_Play_Equipment,6.52; 
                 Q4-2018,Winter_Sports,4.15'''

Spark Logo Tiny Step 2: Parse and Validate a record

In this step, validate a substring that represents a record e.g. "Q1-2018,Exercise_Fitness,10.33"

  1. Define a method validateRecord, which takes a string named recStr (representing a record)
  2. Parse the string by comma
  3. Validate that the string contains 3 fields
    • If there are less than 3 fields then throw an exception
  4. Validate if the thrid field is numeric or not (Note: try type casting to float)
    • If not numeric then throw an exception
  5. Surround the complete implementation by try/except to handle exceptions
  6. If the record is valid (i.e. passes both the checks),

    • then return a tuple with tag (GOOD) and the record string as ("GOOD", recStr)
    • if the record is not valid then return a tuple with tag (BAD) and the record string as ("BAD", recStr)

    Note: recStr is the original string passed to the method

# ANSWERS
 
def validateRecord(recStr):
  
  # Split the string by comma
  fields = recStr.split(",")
 
  # Wrap the code around try/except to handle exceptions
  try:
    # If the number of fields in the string is less than 3, then throw an exception  
    if len(fields) < 3:
      raise  Exception("Expected 3 fields. Found only ", len(fields))
      
    # Convert the third field (revenue) into float type  
    revenue = float(fields[2])
    # If we have reached here without any problem, then it is a good rec. Tag the record GOOD and return
    return ("GOOD", recStr)
    
  # If we have reached here then it is a bad rec. Tag the record BAD and return  
  except Exception as ex: 
    return ("BAD", recStr)
# TEST - Run this cell to test your solution.
 
test1Str = validateRecord("Q1-2018,Exercise_Fitness,10.13")
test1StrExpected = ("GOOD","Q1-2018,Exercise_Fitness,10.13")
 
test2Str = validateRecord("Q1-2018,Exercise_Fitness,as")
test2StrExpected = ("BAD","Q1-2018,Exercise_Fitness,as")
 
test3Str = validateRecord("Q1-2018,Exercise_Fitness")
test3StrExpected = ("BAD","Q1-2018,Exercise_Fitness")
 
assert test1Str == test1StrExpected, "Expected the total to be " + str(test1StrExpected) + " but found " + str(test1Str)
assert test2Str == test2StrExpected, "Expected the total to be " + str(test2StrExpected) + " but found " + str(test2Str)
assert test3Str == test3StrExpected, "Expected the total to be " + str(test3StrExpected) + " but found " + str(test3Str)

Spark Logo Tiny Step 3(a): Define a method to filter good and bad records

In this step parse the string RevenueInfo into substrings and use the method validateRecord to validate each substring.

  1. Define a method called parsetData, which takes the complete string stored in RevenueInfo
  2. Parse the string by a semicolon(;) character, which returns an Array of substrings
  3. Iterate through all the substrings (using functional programming features), validate each substring
  4. Filter all good records that contains the tag GOOD
  5. Filter all bad records that contains the tag BAD
  6. Return a tuple that contains good and bad records as (array of good records, array of bad records)
# ANSWERS
 
def parsetData(dataStr):
  
  # split the string by ';' to get individual record strings
  recList = dataStr.split(";")
  
  # validate each record
  allRecs = list(map(lambda rec: validateRecord(rec.strip()), recList))
 
  # based on GOOD and BAD tag, store in goodRecs and badRecs variable
  badRecs = list(filter(lambda rec: rec[0] == "BAD", allRecs))
  goodRecs = list(filter(lambda rec: rec[0] == "GOOD", allRecs))
  
  # return the tuple
  return (goodRecs, badRecs)

Spark Logo Tiny Step 3(b): Filter good and bad records

  1. Invoke the above method parseData with RevenueInfo as a parameter.
  2. Store the returned tuple values (containing arrays of good and bad records) into two variables called goodrecs and badrecs.
#ANSWERS
 
goodrecs, badrecs = parsetData(RevenueInfo)
# TEST - Run this cell to test your solution.
 
goodrecsLen = len(goodrecs)
goodrecsLenExpected = 11
 
badrecsLen = len(badrecs)
badrecsLenExpected = 1
 
assert goodrecsLen == goodrecsLenExpected, "Expected the total to be " + str(goodrecsLenExpected) + " but found " + str(goodrecsLen)
assert badrecsLen == badrecsLenExpected, "Expected the total to be " + str(badrecsLenExpected) + " but found " + str(badrecsLen)

Spark Logo Tiny Step 4 (a): Define Named Tuple to represent each good record

  1. Define a named tuple CategoryQuarterlyRecord with three fields
    • quarter (String)
    • category (String)
    • revenue (Float)
#ANSWERS
 
from collections import namedtuple
 
CategoryQuarterlyRecord = namedtuple('CategoryQuarterlyRecord', ('quarter', 'category', 'revenue'))

Spark Logo Tiny Step 4(b): Define an utility method to convert a substring into named tuple

  1. Define a method convertToClass, which
    • Takes the string representing a good record ( e.g. "Q1-2018,Exercise_Fitness,10.33")
    • Parses the string by comma
    • Returns a case class of type CategoryQuarterlyRecord
# ANSWERS
 
# Define the method convertToClass()
 
def convertToClass(goodrec):
  
  # split the record by ','
  fields = goodrec.split(",")
 
  # return the case class
  return CategoryQuarterlyRecord(fields[0], fields[1], float(fields[2]))
# TEST - Run this cell to test your solution.
 
caseClass = convertToClass("Q1-2018,Exercise_Fitness,10.33")
caseClassExpected = CategoryQuarterlyRecord("Q1-2018","Exercise_Fitness",10.33)
 
assert caseClass == caseClassExpected, "Expected the total to be " + str(caseClassExpected) + " but found " + str(caseClass)

Spark Logo Tiny Step 4(c): Convert all good records to named tuples

  1. We have all the good record string stored in goodrecs variables.
  2. Convert all the string to an Array of CategoryQuarterlyRecord classes and store in a variable categoryRecs.
# ANSWERS
 
categoryRecs = list(map(lambda rec: convertToClass(rec[1]), goodrecs))
# TEST - Run this cell to test your solution.
 
categoryRecsLen = len(goodrecs)
categoryRecsLenExpected = 11
 
assert categoryRecsLen == categoryRecsLenExpected, "Expected the total to be " + str(categoryRecsLenExpected) + " but found " + str(categoryRecsLen)

Spark Logo Tiny Step 5(a): Create a Class that encapsulates all good records and provides insights

  1. Create a class named CompanyPerformance
  2. The class constructor takes an argument of Array of CategoryQuarterlyRecord and stores in a class level variable.
  3. Implement a method getTotalRevenue, which returns total revenue generated by summing up the revenue from all the records.
  4. Implement a method getCategoryRevenue, which takes the name of the category and returns total revenue generated by the category.
# ANSWERS
 
from functools import reduce
 
class CompanyPerformance:
  
  # class level variable to store the good records
  CategoryRecords = None
  
  #defining constructor  
  def __init__(self, categoryRecs):
     self.CategoryRecords = categoryRecs
  
  # calculates and returns the total revenue across all quarters and categories
  def getTotalRevenue(self):
    
    # Extract revenue column and sum up
    allRevenues = list(map(lambda rec: rec.revenue, self.CategoryRecords))
    totalRevenue = reduce(lambda x, y: x + y, allRevenues)
    return round(totalRevenue, 2)
  
  # calculates and returns total revenue by the cateogry (passed as argument to the function)
  def getCategoryRevenue(self, _category):
    
    # filter records by category, extract revenue column and sum up
    categoryRecs = filter(lambda rec: rec.category == _category, self.CategoryRecords)
    allRevenues = list(map(lambda rec: rec.revenue, categoryRecs))
    totalRevenue = reduce(lambda x, y: x + y, allRevenues)
    return round(totalRevenue, 2)

Spark Logo Tiny Step 5(b): Initialize the class and print the insights

  1. Create an instance of class CompanyPerformance by passing categoryRecs in the constructor.
  2. Assig the instance to a variable named companyPerf2018
  3. Invoke getTotalRevenue and store the result in a variable named totalRevenue.
  4. Invoke getCategoryRevenue with argument Exercise_Fitness and store the result in a variable named totalRevenueInCategory.
# ANSWERS
 
companyPerf2018 = CompanyPerformance(categoryRecs)
totalRevenue = companyPerf2018.getTotalRevenue()
totalRevenueInCategory = companyPerf2018.getCategoryRevenue("Exercise_Fitness")
# TEST - Run this cell to test your solution.
 
totalRevenueExpected = 57.45
totalCategoryRevenueExpected = 24.98
 
assert totalRevenue == totalRevenueExpected, "Expected the total to be " + str(totalRevenueExpected) + " but found " + str(totalRevenue)
assert totalRevenueInCategory == totalCategoryRevenueExpected, "Expected the total to be " + str(totalCategoryRevenueExpected) + " but found " + str(totalRevenueInCategory)

Congratulation! You have completed your capstone project.